The impact of language dynamics on the capitalization of broadcast news

نویسندگان

  • Fernando Batista
  • Nuno J. Mamede
  • Isabel Trancoso
چکیده

This paper investigates the impact of language dynamics on the capitalization of transcriptions of broadcast news. Most of the capitalization information is provided by a large newspaper corpus. Three different speech corpora subsets, from different time periods, are used for evaluation, assessing the importance of available training data in nearby time periods. Results are provided both for manual and automatic transcriptions, showing also the impact of the recognition errors in the capitalization task. Our approach is based on maximum entropy models, uses unlimited vocabulary, and is suitable for language adaptation. The language model for a given language period is produced by retraining a previous language model with data from that time period. The language model produced with this approach can be sorted and then pruned, in order to reduce computational resources, without much impact in the final results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recovering Capitalization and Punctuation Marks on Speech Transcriptions

This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of writte...

متن کامل

Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, includi...

متن کامل

A Study on News Anchors’ Meta-Language and Non-Verbal Factors and their Impact on Audiences

Non-verbal communication or body messaging occurs when facial expressions, tone of voice, head and neck movements, smiling and ... affects others; which may be intentional or unintentional. Farhangi in nonverbal communication: the art of using movement and sound” defines this field as such: "Non-verbal communication is phonetic and non-phonetic messages which have been explained by other than l...

متن کامل

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discrimi...

متن کامل

The Impact of Corporate income Tax and Firm Size on Fixed Investment

This paper is an attempt to analyze the impact of income taxes and market capitalization on fixed investment (investment in tangible assets) by manufacturing companies listed on KSE. This paper basically examines that how corporate income taxes affect fixed investment by reducing cash flow available for a firm to invest and how the firm size in the lights of market capitalization affects fixed ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008